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Abstract — Recently Kutin and Niyogi investigated several no- 
tions of algorithmic stability — a property of a learning map con- 
ceptually similar to continuity — showing that training-stability 
is sufficient for consistency of Empirical Risk Minimization 
while distribution-free CV-stability is necessary and sufficient 
for having finite VC-dimension. This paper concerns a phase 
transition in the training stability of ERM, conjectured by 
the same authors. Kutin and Niyogi proved that ERM on 
finite hypothesis spaces containing a unique risk minimizer has 
training stability that scales exponentially with sample size, 
and conjectured that the existence of multiple risk minimizers 
prevents even super-quadratic convergence. We prove this result 
for the strictly weaker notion of CV-stability, positively resolving 
the conjecture. 

Index Terms — empirical risk minimization, algorithmic stabil- 
ity, threshold phenomena 



I. Introduction 

DEVROYE and Wagner (3} first studied the effect of 
algorithmic stability on the statistical generalization of 
learning. Since then many authors have proposed numerous 
alternate notions of stability and have used them to study 
the performance of algorithms not easily analyzed with other 
techniques QJ, (2j, Q-Q. While the leamability of a concept 
class can now be characterized by the admission of a stable 
learner, finding natural definitions of stability with such prop- 
erties is still open |5J, (6j. As part of their general investigation 
into algorithmic stability in [5], Kutin and Niyogi observed 
that Empirical Risk Minimization (ERM) on hypothesis spaces 
of cardinality two experiences a phase transition in achievable 
rates of stability as the number of risk minimizers increases 
from one to two. Under a unique minimizer stability scales 
exponentially with sample size, while in the presence of 
multiple minimizers no rate faster than quadratic is possible. 
Kutin and Niyogi extended the unique risk minimizer result to 
finite spaces, and conjectured that ERM on a finite hypothesis 
space containing multiple risk minimizers is (0, (^-training- 
stable for no 5 = o(to~ 1/2 ) [5, Conjecture 10.11]. We 
positively resolve this conjecture for weaker CV-stability and 
show furthermore that ERM on such spaces is (0, <5)-CV-stable 
for 5 = 0{m- 1 ' 2 ). 

Studying the effects of multiple risk minimizers is well 
motivated by the practical consideration of feature selection. 
Learning from irrelevant features can lead to multiple risk 
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minimizers, for example when learning in a symmetric hy- 
pothesis class with positive minimum risk. Results showing 
poor generalization in such cases, as implied by slow rates of 
stability, help to justify the use of feature selection. It is also 
noteworthy that restricting focus to finite hypothesis classes 
is far from unreasonable — practitioners often attempt to learn 
on extracted features, which frequently involves discretization, 
and when learning on a finite domain any hypothesis space 
becomes effectively finite. 

II. Preliminaries 

We follow the setting of |5| closely. X denotes the input 
space, y = { — 1,1} the output space and Z = X x y 
their product example space. D is a distribution on Z. Unless 
stated otherwise we assume that m examples are drawn i.i.d. 
according to D. A classifier or hypothesis is a function 
h : X — > { — 1,1};% denotes a set of classifiers or a hypothesis 
space. Whenever we refer to a finite H we assume, without 
loss of generality, that no hi, hi G % exist such that hi ^ h% 
and hi(X) = h 2 {X) a.s. 

A learning algorithm on hypothesis space H is a func- 
tion A : U m >o ^ m "H" When A is understood from 
the context, we write f s = A(s) for s £ Z m . The loss 
incurred by classifier h on example z £ Z is denoted 
by t(h,z). We will assume the 0-1 loss throughout, where 
£(h, (x,y)) — 1 [h(x) ^ y]. Define the risk of h with respect 
to D as R]j(h) = Ez^fl[f(ft, Z)\; and with a slight abuse of 
notation the empirical risk as R s {h) = m _1 Y^T=i ^(^> Zi ) f° r 
s = (zi, . . . , z m ) £ Z m . The risk minimizers of space H with 
respect to distribution D is the set W — axgmin fteW Rr)(h). 
A learning algorithm A is said to implement Empirical Risk 
Minimization if A(s) £ argmin ftgW R s (h) for s £ Z m . 

Based on sequence s — (zi,...,z m ) £ Z m , index 
i £ [to] and example u £ Z, we define sequences 
s % = (zi,...,Zi-i,z i+ i,...,z m ) £ Z m ~ x and s v " = 
(zi, . . . , Zi-i, u, Zi+i, . . . , z m ) £ Z m . The following defini- 
tions capture the relevant notions of stability. 

Definition 2.1: A learning algorithm A is weakly {13,8)- 
hypothe sis -stable or has weak hypothesis stability (/3, 5) if for 
any i £ [m], 

Pr(s,u)~D^ (max \£(f s , z) - £(f S i,u,z)\ < fij > 1 - 6 . 

The notion of weak hypothesis stability was first used by 
Devroye and Wagner in [ 3 1 under the name of stability. Kearns 
and Ron |4l then used the term hypothesis stability for the 
same concept. 
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Definition 2.2: A learning algorithm A is ((3,5)-cross- 
validation- stable, or (f3,5)-CV-stable, if, for any i £ [m], 

Pr ( s,t/)^(|^^)-%,u,P)|</5) > 1-5 . 

Definition 2.3: A learning algorithm A is ((3, ^-overlap- 
stable, or has overlap stability ((3,6), if, for any i £ [m], 

Pr (StU ^ Dm+1 (\R S i(fs) --R S i(f S i,v)\ < P) > 1-6 . 

Combining CV and overlap stability, Kutin and Niyogi 
arrive at the next definition. 

Definition 2.4: A learning algorithm A is ((3, 5)-training- 
stable or, has training stability ((3,6), if A has 

i. CV stability ((3,6); and 

ii. Overlap stability ((3,5). 

Trivially weak hypothesis stability ((3, 5) implies training 
stability ((3,5). Kutin and Niyogi show that training stability 
is sufficient for good bounds on generalization error (for ERM, 
CV stability is sufficient). They also consider the stability of 
ERM on a two-classifier hypothesis class [5 Theorems 9.2 
and 9.5], showing that the achievable rate on 5 undergoes a 
phase transition as the number of risk minimizers increases 
from one to two. 

Theorem 2.5: Consider the class H consisting of the two 
constant classifiers mapping A" to —1 and 1 respectively, and 
let A implement ERM on H (i.e., A outputs a majority label 



Pr 



(X,Y)~D 



(Y = l) 



of training set S). Let p 

1. If p > 

(2ttto)- 

2. If p > \ then A is weakly (0, <5)-hypothesis-stable for 



1 then A is (0, 5) -training-stable for 5(m) 

1/2 



exp 



p 



m/8 + 0(l) 

3. If p= I then A is not (0, £)-CV-stable, or ((3, <5)-overlap- 
stable for any (3 < 1 and any 5 = o^mT 1 / 2 ). 
The case of finite hypothesis spaces is then considered. In 
particular the authors show that fast rates are achieved in the 
presence of a unique risk minimizer. 

Theorem 2.6: Let H be a finite collection of classifiers, and 
let A be a learning algorithm which performs ERM over H. 
Suppose there exists a unique risk minimizer h* £ H, then A 
is weakly (0, <5)-hypothesis-stable for 5 = cxp(— fl(m)). 

The authors then conjecture that the analogue of The- 
orem 2.5 (3) holds for general \H\ < oo: pi Conjec- 



ture 10.11] predicts that under any distribution D inducing 
multiple risk minimizers, ERM is not (0, <5) -training-stable for 

any 5 = o(to~ 1 / 2 ). 

III. Finite Classes with Multiple Risk Minimizers 

Lemma 3.1: Let H be a finite hypothesis space with \%\ > 
2 and with risk minimizers satisfying \H*\ = 2. Then ERM 
on % with respect to the 0-1 loss, on a sample of m examples, 
is not (0, (5)-CV-stable for any 5 — o(to -1 / 2 ). Furthermore, 
ERM on n is (0, <5)-CV-stable for 5 = Ofm" 1 / 2 ). 

Proof: Let H* — {hi, /12}. We proceed by first showing 
that fs lies in W with exponentially increasing probability 
and then that switching within W occurs often. Let e = 
min/jg-^y-^* Krj(h) — Ru(hi), which exists and is positive 
by the finite cardinality of H. Then by the union bound, 



R D (h) > R D (hi) + e for all h £ H\H*, and Chemoff's 
bound 

Pr(/s £ H*) 

> Pr(Vh £ W\W*,R s (/ii) < R s (h)) 

> Pr(Vh £ H\H*,R S (hi) < R D (hi) + e/2 < R s (h)) 

> l-Pr(R s (/n) >R D (hi) + e/2) 

- Yl Pv(R D (h 1 ) + e/2>R s (h)) 

he-H\-H* 

> l-Pr(R s (/n) >R D (ht)+e/2) 

- Pr(PD(h)-e/2>R s (h)) 

he-H\-H* 

> 1 - cxp (-e 2 m/2) - Y CX P (-e 2 ^/2) 

h£H\H* 

= l-(|H|-l)exp(-e 2 m/2) . (1) 

Consider ERM on H*. Without loss of generality assume 
that ERM on W , when h\ and hi have equal empirical risk, 
selects h±. Let Zi = {z £ Z \ t(h\,z) < £(h 2 ,z)} and 
Z2 be defined analogously. Let p = Pi(Z £ Z\ U Z2), which 
is positive by assumption that h'(X) = h"(X) a.s. implies 
hi = h". Then for all i £ [m] 

Pi D 2((Z t ,U)£(Z 1 xZ 2 )yj(Z 2 xZ 1 )) = p 2 /2 . (2) 

For the moment, assume that for all i £ [m] 

Pr(|R si (/i 1 )-R s< (/ l2 )| < (m-1)- 1 ) =0(771-^ . (3) 

Conditioned on the events of Q and ([3J, the probability 
of ERM on S outputting a different hypothesis than on 
iS 1 '* 7 is at least 1/2. This fact can be proved by cases 
on |Rgi(/ii) — Rgi(/i2)|, while assuming that (Zi,U) £ 
(Z x x Z 2 ) U (Z 2 x Z x ). If R S i(hi) = R S i(h 2 ), then f s £ 
f S i,v trivially. If |R s *(/ii) - R s >(h 2 ) \ = (m - l)" 1 then 
{|R si , z (/ ll )-R si , z (/ l2 )| : Z£{Z t ,U}} = {0,2m- 1 } 
and it follows that |Rs(/ii) - Rs(/i2)| ^ \Rs^ u (hi) - 
Rgi,u (h 2 ) |; and since by symmetry the probability that 
R S i,z(h 2 ) < Rgi.z(hi) conditioned on the corresponding 
difference being 2m~ 1 is 1/2, it follows that fs ^ fs^u 
with probability 1/2. Thus we have shown that on W, for all 
i £ [m] 

Pr (S)C /)^ m+ i(/ S ? fs*.u I */n&) > 1/2 , (4) 

where 

^ = {(S,U) : (Z U U) £(Z ± x Z 2 )U(Z 2 xZ 2 )} 
38 - {(S,U) : \R si (hi)-R si (h 2 )\ < (m-1)- 1 } . 

Notice that since / s , f S i,u C W and U £ Z x U Z 2 , f s ^ 
f S i,u implies that £(f s , U) ^ £(f s ,,u,U). Then together ([T]i- 
Q lead to the following statement about ERM over H: 

P*V,u)~d~+*Ws, u) - Kfst.v ,U)\>0) 

> Px(\£(f s ,U)-e(f si ,u,U)\>0, f s ,f s „v£U*) 

> 0(p 2 m- 1/2 ^ (1 - 2(\U\ - 1) cxp (-e 2 m/2)) 
= O^n- 1 / 2 ) . 
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And so ERM is not (0, <5)-CV-stable for any 5 = o(m~ 1/2 ) 
as claimed. Furthermore ERM is (0, <5)-CV-stable for 6 = 
0(m -1 / 2 ) since 

Pr ( s,c/)~z^ +1 (|*(/s, U) ~ t(J#.°,U)\ > 0) 

< Pt(\£(f s ,U)-£(f si ,v,U)\>0, fsJs^en*) 
+ Pr(f S (£H*Vf Si ,v <£H*) 

< o(ro~ 1/2 ) +2{\H\ - l)exp(-e 2 m/2) 

All that remains is to verify d3h. Consider X nq ~ Bin (n, q) 
for n G N, g e [0, 1] and note that for k e N U {0} 



Pr 









( 













IV( X fe i = |j , k even 
2Pr(X fe i = *=l) , k odd 



> Pr X. 



Splitting on £™ 1 i z j e2iU Z 2 ]— the 
Bin (to — l,p) number of examples in S l on which 
hi , /12 disagree — and noting that when conditioned 
on this sum being equal to k, the random variable 

(R S i(hi) -R S i(h2)) + | ~ Bin (k, 1/2), leads to 
the lower-bound 



PrOR^O-R^)! < (m-1)" 1 

m— 1 

^Pr(X m _ liP = fc)Pr 



fc=0 



> min Pr 

0<fc<m-l 



= min Pr 

0<fe<m-l 

> Pt(x, 



X, 



, m-1 

<-)J2 Pr(* m -i, P = i) 



< 



j=0 



where the last relation is a consequence of Stirling's approx- 
imation. Let c = \p(m — 1)/2"|. The upper-bound follows 
similarly: 

PrORs^O-Rs^)! < (to-1)- 1 ) 

C 

= ^Pr(X m _ l!P = fc)Pr 
m— 1 

+ J] Pr(X m _ liP = fe)Pr( 

fe=c+l 

< Pr(X m _i, p < c) 
+ sup 





x k 




( 


fc, 2 2 















( 








fc.s 2 







x k 




Pr( 


A i. i 

«, 5 2 




i} v 





< 0(e- pm )+°(^ 1/2 ) 



where the penultimate relation follows from an application of 
Chernoff's inequality to the first term and Stirling's approxi- 
mation to the second. □ 
Theorem 3.2: Let H be a finite hypothesis space with 
\H\ > 2 and with risk minimizers \H*\ > 1. Then any learning 
algorithm A implementing ERM on T~L with respect to the 0-1 
loss, on a sample of m examples, is not (0, <5)-CV-stable for 
any 5 = o(to _1//2 ). Furthermore, A is (0, <5)-CV-stable for 
5 = 0(m- 1 / 2 ). 

Proof: Define e as before. Arbitrarily order the risk 
minimizers T~L* = {h%, . . . , h n } where n — \H*\, and letpij = 
Pi(hi(X) ^ hj(X)) > by assumption that h'(X) = h"(X) 
a.s. implies h' = h" for each h' , h" € T-L. Let fj, be the result 
of running A on sample T in the hypothesis space {h i} hj}, 
where ties between empirical risk minimizers are broken as 
in ERM over W. Let be the result of running A on T 
in W. It follows that for all {j, k} C [n], conditional upon 



event 



\fs ~ hji foi 



f J s -,u = f* s .,u = h k , and so by (g-g 



' Js 



fs 



hj and 



= Pr 



> o 















fc, 3 2 




> 



> O [ min p 2 i,m 1 ^ 2 

V{j,fc}C[n] Jfe 

It follows that 

Pr ( s,£/)^^ +1 (\i(fs, U) - %fo.u ,U)\>0) 
> min PT(\e(flU)-£(r si ,u,U)\ > | 

{j,/c}C[n] 



2 -1/2 

mm p ik m 1 
0',fc}c[n] Jfc 



(5) 



(6) 



Observing that Pr(/ S € H*) > 1 - \H\ exp(~e 2 TO/2) by the 
same argument that lead to ([TJ, we have as before 

Pr(|*(/s, E0 E0I >0) 

> Pi(\£(f s ,U)-£(f S i,v,U)\>0, fgJsi.u €H*) 

> Pi(\£(r s ,u)-£(r si ,u,u)\>o) 

x (1 - Pr(/ S i 1C V ^ «*)) 

= O' niin n 2 ., m.- 1 / 2 ' ^ 1 oliyl „„„ / ,2 



b .mm [ri] Pifcm ' " ) (l - 2|H| exp (- £ 2 to/2)) 



(7) 



This implies that ERM on "H is not (0, <5)-CV-stable for any 
8 = o(to -1 / 2 ). Similarly we derive an upper-bound analogous 



to (j7]i by the same technique used in the proof of Lemma 3.1 

PT(\£(f s ,U)-£(f S i,u,U)\ >0) 
< Pv(\£(f si ,u,U)-£(f s ,U)\>0, fsJs^en*) 

+ Pv(f s i u* v /^.p ^ n*) 

= of r max ^p 2 k m- 1/2 ^j + 2\H\ cxp (-e 2 m/2) 



This implies that ERM on H is (0, 5)-CV-stable for 6 = 

0(to" 1 / 2 ). □ 
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